Python Machine Learning Blueprints: Intuitive data projects you can relate to by Combs Alexander
Author:Combs, Alexander [Combs, Alexander]
Language: eng
Format: azw3
Publisher: Packt Publishing
Published: 2016-07-29T04:00:00+00:00
3
0
.1761
0
0
0
.4471
Why do all of this? To obtain a high tf-idf value, a term would need to have a high number of occurrences in low number of documents. In this way, documents can be said to be represented by terms with high tf-idf values.
With this framework, we'll now convert our training set into a tf-idf matrix:
from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(ngram_range=(1,3), stop_words='english', min_df=3) tv = vect.fit_transform(df['text'])
With these three lines, we have converted all our documents into a tf-idf vector. A couple of points to note. We passed in a number of parameters: ngram_range, stop_words, and min_df. Let's discuss each of these.
First, ngram_range is how the document is tokenized. In our prior examples, we used each word as a token, but here we are using every one to three word sequence as tokens. Let's take our second sentence, "She ate lunch." We'll ignore stop words for the moment. The ngrams for this sentence would be: "she", "she ate", "she ate lunch", "ate", "ate lunch", and "lunch".
Next, we have stop_words. We pass in "english" for this to remove all the English stop words. As discussed previously, this removes all terms that lack informational content.
Finally, we have min_df. This removes all words from consideration that don't appear in at least three documents. Adding this removes very rare terms and cuts down on the size of our matrix.
Now that our article corpus is in a workable numerical format, we'll move on to feeding it to our classifier.
Download
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Deep Learning with Python by François Chollet(12569)
Hello! Python by Anthony Briggs(9914)
OCA Java SE 8 Programmer I Certification Guide by Mala Gupta(9795)
The Mikado Method by Ola Ellnestam Daniel Brolund(9777)
Dependency Injection in .NET by Mark Seemann(9337)
Algorithms of the Intelligent Web by Haralambos Marmanis;Dmitry Babenko(8295)
Test-Driven iOS Development with Swift 4 by Dominik Hauser(7763)
Grails in Action by Glen Smith Peter Ledbrook(7696)
The Well-Grounded Java Developer by Benjamin J. Evans Martijn Verburg(7557)
Becoming a Dynamics 365 Finance and Supply Chain Solution Architect by Brent Dawson(7056)
Microservices with Go by Alexander Shuiskov(6819)
Practical Design Patterns for Java Developers by Miroslav Wengner(6736)
Test Automation Engineering Handbook by Manikandan Sambamurthy(6677)
Secrets of the JavaScript Ninja by John Resig Bear Bibeault(6413)
Angular Projects - Third Edition by Aristeidis Bampakos(6083)
The Art of Crafting User Stories by The Art of Crafting User Stories(5609)
NetSuite for Consultants - Second Edition by Peter Ries(5549)
Demystifying Cryptography with OpenSSL 3.0 by Alexei Khlebnikov(5350)
Kotlin in Action by Dmitry Jemerov(5062)
